Dynamic programming: 2014

Friday, May 30, 2014

Optical character reorganization system

SUMMARY

In the running world, there is growing demand for the software systems to recognize

characters in computer system when information is scanned through paper documents as we

know that we have number of newspapers and books which are in printed format related to

different subjects. These days there is a huge demand in storing the information available in

these paper documents in to a computer storage disk and then later reusing this information by

searching process. One simple way to store information in these paper documents in to

computer system is to first scan the documents and then store them as IMAGES. But to reuse

this information it is very difficult to read the individual contents and searching the contents

form these documents line-by-line and word-by-word. The reason for this difficulty is the font

characteristics of the characters in paper documents are different to font of the characters in

computer system. As a result, computer is unable to recognize the characters while reading

them.

Thus our need is to develop character recognition software system to perform Document

Image Analysis which transforms documents in paper format to electronic format. For this

process there are various techniques in the world. Among all those techniques we have chosen

Optical Character Recognition as main fundamental technique to recognize characters. OCR

thus derives the meaning of characters, their font properties from their bit-mapped images.

PROBLEM STATEMENT

In the running world there is a growing demand for the users to convert the printed documents

in to electronic documents for maintaining the security of their data. Hence the basic OCR

system was invented to convert the data available on papers in to computer process able

documents, So that the documents can be editable and reusable. The existing system/the

previous system of OCR on a grid infrastructure is just OCR without grid functionality. That is

the existing system deals with the homogeneous character recognition or character recognition

of single languages.

The drawback in the early OCR systems is that they only have the capability to convert and

recognize only the documents of English or a specific language only. That is, the older OCR

system is uni-lingual.

OVERVIEW OF PROPOSED SOLUTION APPROACH

The problem is for the software systems to recognize characters in computer system when

information is scanned through paper documents as we know that we have number of

newspapers and books which are in printed format related to different subjects. Whenever

we scan the documents through the scanner, the documents are stored as images such as

jpeg, gif etc., in the computer system. These images cannot be read or edited by the user.

But to reuse this information it is very difficult to read the individual contents and the

contents form these documents line-by-line and word-by-word. These days there is a huge

demand in “storing the information available in these paper documents in to a computer

storage disk and then later editing or reusing this information by searching process”.

INTEGRATED SUMMARY OF THE LITERATURE STUDIED

Neural network play a very important role in Artificial intelligence. For the neural network training is

important part it provide ability to guess the best Answer. There are two type of training for neural

networks

 Un-Supervised Training

 Supervised Training

Supervised training provides the neural network with training sets and the anticipated output. Unsupervised

training supplies the neural network with training sets, but there is no anticipated output provided

UNSUPERVISED TRAINING

Unsupervised training is a very common training technique for Kohonen neural networks. What is meant by

training without supervision is that the neural network is provided with training sets, which are collections of

defined input values. But the unsupervised neural network is not provided with anticipated outputs.

Unsupervised training is usually used in a classification neural network. A classification neural network takes

input patterns, which are presented to the input neurons. These input patterns are then processed, and one

single neuron on the output layer fires. This firing neuron can be thought of as the classification of which

group the neural input pattern belonged to. Handwriting recognition is a good application of a classification

neural network.

The input patterns presented to the Kohonen neural network are the dot image of the character that was

hand written. We may then have 26 output neurons, which correspond to the 26 letters of the English alphabet.

The Kohonen neural network should classify the input pattern into one of the 26 input patterns.

During the training process the Kohonen neural network in handwritten recognition is presented with 26 input

patterns. The network is configured to also have 26 output patterns. As the Kohonen neural network is trained

the weights should be adjusted so that the input patterns are classified into the 26 output neurons. This

technique results in a relatively effective method for character recognition.

SUPERVISED TRAINING

The supervised training method is similar to the unsupervised training method in that training sets are

provided. Just as with unsupervised training these training sets specify input signals to the neural network.

The primary difference between supervised and unsupervised training is that in supervised training the

expected outputs are provided. This allows the supervised training algorithm to adjust the weight matrix based

on the difference between the anticipated output of the neural network, and the actual output.

There are several popular training algorithms that make use of supervised training. One of the most

common is the back-propagation algorithm. It is also possible to use an algorithm such as simulated annealing

or a genetic algorithm to implement supervised training.

FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTS

FUNCTIONAL REQUIREMENTS

1. The product shall be able to recognize the input character

2. The product shall be able to load the image form the local storage where they are stored.

3. The product shall be able to edit the image correctly.

4. The product should be proper character recognition system.

5. Product shell recognizes the image clearly.

NON FUNCTIONAL REQUIREMENT

1. Reliable: The apps should display authentic and right guidelines/information regarding

information security and should not mislead. Failures/errors should be debug able .

2. Performance: Speed should be efficiently high. The apps/buttons should respond within

1-5s

3. Safety: The apps should not cause any damage to human in any form.

ARCHITECTURE OF THE PROPOSED SYSTEM

The Architecture of the optical character recognition system on a grid infrastructure consists of the three

main components. They are:-

 Scanner

 OCR Hardware or Software

 Output Interface

MODULES AND THEIR FUNCTIONALISTS

Our software system Optical Character Recognition on a grid infrastructure can be divided into five

modules based on its functionality.The modules classified are as follows:-

 Document Processing Module

 System Training Module.

 Document Recognition Module.

 Document Editing Module and

 Document Searching Module.

FINDINGS

 System can detect the character if there is a slight difference in training set character and input set of

character.

 This system more stable as compared to other existing systems.

 Compare with other systems, this provides more correct output.

CONCLUSION

What does the future hold for OCR? Given enough entrepreneurial designers and sufficient research and

development dollars, OCR can become a powerful tool for future data entry applications. However, the

limited availability of funds in a capital-short environment could restrict the growth of this technology. But,

given the proper impetus and encouragement, a lot of benefits can be provided by the OCR system. They are:-

 The automated entry of data by OCR is one of the m ost attractive, labor reducing technology

 The recognition of new font characters by the system is very easy and quick.

 We can edit the information of the documents more conveniently and we can reuse the edited information

as and when required.

 The extension to software other than editing and searching is topic for future works.

The Grid infrastructure used in the implementation of Optical Character Recognition system can be

efficiently used to speed up the translation of image based documents into structured documents that are

currently easy to discover, search and process

FUTURE WORK

The Optical Character Recognition software can be enhanced in the future in different kinds of ways

such as:

 Training and recognition speeds can be increased greater and greater by making it more user-friendly.

 Many applications exist where it would be desirable to read handwritten entries. Reading handwriting is a

very difficult task considering the diversities that exist in ordinary penmanship. However, progress is

being made.

for the full presentation --> http://www.slideshare.net/IAMINURHEARTS1/ocr-ppt-35272335

for youtube presentation -->https://www.youtube.com/watch?v=AYRTcxPvw04&feature=youtu.be

Thursday, April 24, 2014

Randomness

You’re given a string S of N characters. It’s known that the string consists of lowercase latin letters. The string is generated randomly. That means that every symbol is chosen randomly and independently from others from the set {‘a’, ‘b’, …, ‘z’}. All the letters has equal probability to appear.

You’re given Q queries on this string. Each query is of the form P C, where P is an integer between 1 and N (both inclusive) and C is a character from the set {‘a’, ‘b’, …, ‘z’}. Both P and C were chosen at random and independently from other queries.

When you have a query of the form P C you have to change the Pth symbol of S to C. After every change we ask you to output the number of distinct nonempty sub-strings of S.

Input Format
The first line of input consists of two single space separated integers N and Q - the length of the string S and the number of queries respectively.

The second line contains the string S itself.

The following Q lines describe the queries in the form P C, where P and C are also separated with a single space.

Constraints

4 ≤ N ≤ 75000
4 ≤ Q ≤ 75000

Output Format
Output Q lines. Output the number of distinct substrings of S after the ith query on theith line of the output.

Sample Input

4 4 
aaab
1 a
2 b
3 c
4 d

Sample Output

Explanation

after replacing the character at 1st index with a, we still have the original string aaab. The total non empty substrings of aaab are

a b aa ab aaa aab aaab

hence 7.

after replacing the character at 2nd index with b, we have the string abab. The total non empty substrings of abab are

a b ab ba aba bab abab

hence 7.

after replacing the character at 3rd index with c, we have the string abcb. The total non empty substrings of abcb are

a b c ab bc cb abc bcb abcb

hence 9.

after replacing the character at 4th index with d, we have the string abcd. The total non empty substrings of abcd are

a b c d ab bc cd abc bcd abcd

hence 10.

Scoring

There are 12 test cases.

The first four test cases N = 100 and Q = 500
The next four test cases N = 75000 and Q = 10
The last four test cases N = 75000 and Q = 75000

solution-->

The solution consists of constructing the suffix array and then finding the number of distinct substrings based on the Longest Common Prefixes.

One key observation here is that:

If you look through the prefixes of each suffix of a string, you have covered all substrings of that string.

Let us take an example: BANANA

Suffixes are:

0) BANANA
1) ANANA
2) NANA
3) ANA
4) NA
5) A

It would be a lot easier to go through the prefixes if we sort the above set of suffixes, as we can skip the repeated prefixes easily.

Sorted set of suffixes:

5) A
3) ANA
1) ANANA
0) BANANA
4) NA
2) NANA

From now on,

LCP = Longest Common Prefix of 2 strings.

Initialize

ans = length(first suffix) = length("A") = 1.

Now consider the consecutive pairs of suffixes, i.e, [A, ANA], [ANA, ANANA], [ANANA, BANANA], etc. from the above set of sorted suffixes.

We can see that,

LCP("A", "ANA") = "A".

All characters that are not part of the common prefix contribute to a distinct substring. In the above case, they are 'N' and 'A'. So they should be added to ans.

So we have,

1 2	ans += length("ANA") - LCP("A", "ANA") ans = ans + 3 - 1 = ans + 2 = 3

Do the same for the next pair of consecutive suffixes: ["ANA", "ANANA"]

LCP("ANA", "ANANA") = "ANA".

ans += length("ANANA") - length(LCP)

=> ans = ans + 5 - 3

=> ans = 3 + 2 = 5.

Similarly, we have:

1 2	LCP("ANANA", "BANANA") = 0 ans = ans + length("BANANA") - 0 = 11

1 2	LCP("BANANA", "NA") = 0 ans = ans + length("NA") - 0 = 13

1 2	LCP("NA", "NANA") = 2 ans = ans + length("NANA") - 2 = 15

Hence the number of distinct substrings for the string "BANANA" = 15.

code---->

#include<iostream>

#include<string>

#include<algorithm>

using namespace std ;

int LCP(string s , string s1){

int x=s.size();

int y=s1.size();

if(x>y)

x=y;

int count=0 ;

for(int i=0 ; i < x ;i++){

if(s.at(i)==s1.at(i))

count++;

else

break ;

}

return count ;

}

void check(string s){

int x= s.size();

string ss[x];

for(int i=0 ; i<x ;i++)

ss[i]= s.substr(i);

sort(ss,ss+x);

int ans=0;

ans=ss[0].size();

for(int i=1 ; i<x ; i++)

ans+=( ss[i].size() - LCP(ss[i],ss[i-1]));

cout << ans << endl ;

}

int main(){

int x=0,y=0 ;

cin >> x >> y ;

string s ;

cin >> s ;

while(y--){

int p ;

char z ;

cin >>p >> z ;

s[p-1]=z;

check(s);

}

return 0 ;

}

Tuesday, April 22, 2014

Volleyball Match

Tatyana is a big sports fan and she likes volleyball a lot! She writes down the final scores of the game after it has ended in her notebook.

If you are not familiar with the rules of volleyball, here’s a brief:

2 teams play in total
During the course of the game, each team gets points, and thus increases its score by 1.
The initial score is 0 for both teams.

The game ends when

One of the teams gets 25 points and another team has < 24 points ( strictly less than 24).
If the score ties at 24:24, the teams continue to play until the absolute difference between the scores is 2.

Given the final score of a game in the format A:B i.e., the first team has scored Apoints and the second has scored B points, can you print the number of different sequences of getting points by teams that leads to this final score?

Input Format
The first line contains A and the second line contains B.

Constraints

0 ≤ A , B ≤ 109

Output Format
Output the number of different sequences of getting points by the teams that leads to the final score A : B. Final means that the game should be over after this score is reached. If the number is larger than 109+7, output number modulo 109 + 7. Print 0 if no such volleyball game ends with the given score.

Example input #00

3
25

Example output #00

Example input #01

24
17

Example output #01

Explanation #01

There’s no game of volleyball that ends with a score of 24 : 17.

solution -->

Lets consider there are r−1 blocks, so that, there will be r spaces to be filled (including the left of the left most block and right of the right most block). Now,

n identical things should be filled in these spaces. There are

n+r−1 things in that line now, which can be arranged in

(n+r−1)!ways.

But, we should avoid the arrangements between the

n things and

r−1 blocks, since they are identical.

So, the final answer is

( n + r - 1 ) ! n ! ( r - 1 ) ! = C (n + r - 1, r - 1) .

code-->

#include<iostream>
using namespace std ;

long long fact[100];
int p = 1000000007 ;

long long pow(long long a , int b){
long long x=1 , y=a;
while(b){
if(b%2==1){
x=x*y;
x=x%p;
}
y=y*y;
y=y%p ;
b=b/2;
}
return x ;

}
int max(int a , int b ){
return a>b?a:b ;
}

int main(){

int a,b ;

cin >> a >> b ;

fact[0]=1 ;

for(int i=1 ; i<100 ;i++)
fact[i]=(fact[i-1]*(long long)i)%p;


if( (a<25 && b<25) || abs(a-b)<2){
cout <<"0";
return 0 ;
}

if(a>=24 && b>=24){

if(abs(a-b)!=2){
cout <<"0";
return 0 ;
}

int z =max(a,b)-2;
z=z-24;
long long q= (fact[24]*fact[24])%p ;
long long ans =(fact[24+25-1]*pow(q,p-2))%p;

long long zz=pow(2,z);

    ans=(ans*zz)%p;

cout << ans << endl ;
return 0 ;
}


int y;
if(a>b)
y=b;
else
y=a;

long long z = (fact[y]*fact[25-1])%p ;
long long ans=(fact[25+y-1]*pow(z,p-2))%p;

cout << ans << endl ;

return 0 ;
}