Friday, May 30, 2014

Optical character reorganization system

SUMMARY


In the running world, there is growing demand for the software systems to recognize
characters in computer system when information is scanned through paper documents as we
know that we have number of newspapers and books which are in printed format related to
different subjects. These days there is a huge demand in storing the information available in
these paper documents in to a computer storage disk and then later reusing this information by
searching process. One simple way to store information in these paper documents in to
computer system is to first scan the documents and then store them as IMAGES. But to reuse
this information it is very difficult to read the individual contents and searching the contents
form these documents line-by-line and word-by-word. The reason for this difficulty is the font
characteristics of the characters in paper documents are different to font of the characters in
computer system. As a result, computer is unable to recognize the characters while reading
them.


Thus our need is to develop character recognition software system to perform Document
Image Analysis which transforms documents in paper format to electronic format. For this
process there are various techniques in the world. Among all those techniques we have chosen
Optical Character Recognition as main fundamental technique to recognize characters. OCR
thus derives the meaning of characters, their font properties from their bit-mapped images.



PROBLEM STATEMENT


In the running world there is a growing demand for the users to convert the printed documents
in to electronic documents for maintaining the security of their data. Hence the basic OCR
system was invented to convert the data available on papers in to computer process able
documents, So that the documents can be editable and reusable. The existing system/the
previous system of OCR on a grid infrastructure is just OCR without grid functionality. That is
the existing system deals with the homogeneous character recognition or character recognition
of single languages.
The drawback in the early OCR systems is that they only have the capability to convert and
recognize only the documents of English or a specific language only. That is, the older OCR
system is uni-lingual.



OVERVIEW OF PROPOSED SOLUTION APPROACH

The problem is for the software systems to recognize characters in computer system when
information is scanned through paper documents as we know that we have number of
newspapers and books which are in printed format related to different subjects. Whenever
we scan the documents through the scanner, the documents are stored as images such as
jpeg, gif etc., in the computer system. These images cannot be read or edited by the user.
But to reuse this information it is very difficult to read the individual contents and the
contents form these documents line-by-line and word-by-word. These days there is a huge
demand in “storing the information available in these paper documents in to a computer
storage disk and then later editing or reusing this information by searching process”.



INTEGRATED SUMMARY OF THE LITERATURE STUDIED

Neural network play a very important role in Artificial intelligence. For the neural network training is
important part it provide ability to guess the best Answer. There are two type of training for neural
networks
 Un-Supervised Training
 Supervised Training
Supervised training provides the neural network with training sets and the anticipated output. Unsupervised
training supplies the neural network with training sets, but there is no anticipated output provided

UNSUPERVISED TRAINING

Unsupervised training is a very common training technique for Kohonen neural networks. What is meant by
training without supervision is that the neural network is provided with training sets, which are collections of
defined input values. But the unsupervised neural network is not provided with anticipated outputs.
Unsupervised training is usually used in a classification neural network. A classification neural network takes
input patterns, which are presented to the input neurons. These input patterns are then processed, and one
single neuron on the output layer fires. This firing neuron can be thought of as the classification of which
group the neural input pattern belonged to. Handwriting recognition is a good application of a classification
neural network.
The input patterns presented to the Kohonen neural network are the dot image of the character that was
hand written. We may then have 26 output neurons, which correspond to the 26 letters of the English alphabet.
The Kohonen neural network should classify the input pattern into one of the 26 input patterns.
During the training process the Kohonen neural network in handwritten recognition is presented with 26 input
patterns. The network is configured to also have 26 output patterns. As the Kohonen neural network is trained
the weights should be adjusted so that the input patterns are classified into the 26 output neurons. This
technique results in a relatively effective method for character recognition.


SUPERVISED TRAINING

The supervised training method is similar to the unsupervised training method in that training sets are
provided. Just as with unsupervised training these training sets specify input signals to the neural network.
The primary difference between supervised and unsupervised training is that in supervised training the
expected outputs are provided. This allows the supervised training algorithm to adjust the weight matrix based
on the difference between the anticipated output of the neural network, and the actual output.
There are several popular training algorithms that make use of supervised training. One of the most
common is the back-propagation algorithm. It is also possible to use an algorithm such as simulated annealing
or a genetic algorithm to implement supervised training.


FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTS

FUNCTIONAL REQUIREMENTS
1. The product shall be able to recognize the input character
2. The product shall be able to load the image form the local storage where they are stored.
3. The product shall be able to edit the image correctly.
4. The product should be proper character recognition system.
5. Product shell recognizes the image clearly.

NON FUNCTIONAL REQUIREMENT

1. Reliable: The apps should display authentic and right guidelines/information regarding
information security and should not mislead. Failures/errors should be debug able .
2. Performance: Speed should be efficiently high. The apps/buttons should respond within
1-5s
3. Safety: The apps should not cause any damage to human in any form.


ARCHITECTURE OF THE PROPOSED SYSTEM

The Architecture of the optical character recognition system on a grid infrastructure consists of the three
main components. They are:-
 Scanner
 OCR Hardware or Software
 Output Interface




MODULES AND THEIR FUNCTIONALISTS

Our software system Optical Character Recognition on a grid infrastructure can be divided into five
modules based on its functionality.The modules classified are as follows:-
 Document Processing Module
 System Training Module.
 Document Recognition Module.
 Document Editing Module and
 Document Searching Module.



FINDINGS

 System can detect the character if there is a slight difference in training set character and input set of
character.
 This system more stable as compared to other existing systems.
 Compare with other systems, this provides more correct output.

CONCLUSION

What does the future hold for OCR? Given enough entrepreneurial designers and sufficient research and
development dollars, OCR can become a powerful tool for future data entry applications. However, the
limited availability of funds in a capital-short environment could restrict the growth of this technology. But,
given the proper impetus and encouragement, a lot of benefits can be provided by the OCR system. They are:-
 The automated entry of data by OCR is one of the m ost attractive, labor reducing technology
 The recognition of new font characters by the system is very easy and quick.
 We can edit the information of the documents more conveniently and we can reuse the edited information
as and when required.
 The extension to software other than editing and searching is topic for future works.
The Grid infrastructure used in the implementation of Optical Character Recognition system can be
efficiently used to speed up the translation of image based documents into structured documents that are
currently easy to discover, search and process


FUTURE WORK

The Optical Character Recognition software can be enhanced in the future in different kinds of ways
such as:
 Training and recognition speeds can be increased greater and greater by making it more user-friendly.
 Many applications exist where it would be desirable to read handwritten entries. Reading handwriting is a
very difficult task considering the diversities that exist in ordinary penmanship. However, progress is
being made.


for the full presentation --> http://www.slideshare.net/IAMINURHEARTS1/ocr-ppt-35272335

for youtube presentation -->https://www.youtube.com/watch?v=AYRTcxPvw04&feature=youtu.be

Thursday, April 24, 2014

Randomness

You’re given a string S of N characters. It’s known that the string consists of lowercase latin letters. The string is generated randomly. That means that every symbol is chosen randomly and independently from others from the set {‘a’, ‘b’, …, ‘z’}. All the letters has equal probability to appear.
You’re given Q queries on this string. Each query is of the form P C, where P is an integer between 1 and N (both inclusive) and C is a character from the set {‘a’, ‘b’, …, ‘z’}. Both P and C were chosen at random and independently from other queries.
When you have a query of the form P C you have to change the Pth symbol of S to C. After every change we ask you to output the number of distinct nonempty sub-strings of S.
Input Format
The first line of input consists of two single space separated integers N and Q - the length of the string S and the number of queries respectively.
The second line contains the string S itself.
The following Q lines describe the queries in the form P C, where P and C are also separated with a single space.
Constraints
4 ≤ N ≤ 75000
4 ≤ Q ≤ 75000
Output Format
Output Q lines. Output the number of distinct substrings of S after the ith query on theith line of the output.
Sample Input
4 4 
aaab
1 a
2 b
3 c
4 d
Sample Output
7
7
9
10
Explanation
after replacing the character at 1st index with a, we still have the original string aaab. The total non empty substrings of aaab are
a b aa ab aaa aab aaab
hence 7.
after replacing the character at 2nd index with b, we have the string abab. The total non empty substrings of abab are
a b ab ba aba bab abab
hence 7.
after replacing the character at 3rd index with c, we have the string abcb. The total non empty substrings of abcb are
a b c ab bc cb abc bcb abcb
hence 9.
after replacing the character at 4th index with d, we have the string abcd. The total non empty substrings of abcd are
a b c d ab bc cd abc bcd abcd
hence 10.
Scoring
There are 12 test cases.
The first four test cases N = 100 and Q = 500
The next four test cases N = 75000 and Q = 10
The last four test cases N = 75000 and Q = 75000

solution-->
Thsolution consists of constructing the suffix array and then finding the number of distinct substrings based on the Longest Common Prefixes.

One key observation here is that:
If you look through the prefixes of each suffix of a string, you have covered all substrings of that string.


Let us take an example: BANANA

Suffixes are:
0) BANANA
1) ANANA
2) NANA
3) ANA
4) NA
5) A

It would be a lot easier to go through the prefixes if we sort the above set of suffixes, as we can skip the repeated prefixes easily.

Sorted set of suffixes:
5) A
3) ANA
1) ANANA
0) BANANA
4) NA
2) NANA

From now on, 
LCP = Longest Common Prefix of 2 strings.

Initialize
ans = length(first suffix) = length("A") = 1.


Now consider the consecutive pairs of suffixes, i.e, [A, ANA], [ANA, ANANA], [ANANA, BANANA], etc. from the above set of sorted suffixes.

We can see that,
LCP("A", "ANA") = "A".


All characters that are not part of the common prefix contribute to a distinct substring. In the above case, they are 'N' and 'A'. So they should be added to ans.

So we have, 
1
2
ans += length("ANA") - LCP("A", "ANA") 
ans = ans + 3 - 1 = ans + 2 = 3


Do the same for the next pair of consecutive suffixes: ["ANA", "ANANA"]
1
2
3
4
LCP("ANA", "ANANA") = "ANA".
ans += length("ANANA") - length(LCP)
=> ans = ans + 5 - 3
=> ans = 3 + 2 = 5.


Similarly, we have:
1
2
LCP("ANANA", "BANANA") = 0
ans = ans + length("BANANA") - 0 = 11

1
2
LCP("BANANA", "NA") = 0
ans = ans + length("NA") - 0 = 13

1
2
LCP("NA", "NANA") = 2
ans = ans + length("NANA") - 2 = 15


Hence the number of distinct substrings for the string "BANANA" = 15.

code---->

#include<iostream>
#include<string>
#include<algorithm>
using namespace std ;

int LCP(string s , string s1){
int x=s.size();
int y=s1.size();
if(x>y)
    x=y;
    int count=0 ;
    for(int i=0 ; i < x ;i++){
        if(s.at(i)==s1.at(i))
            count++;
        else
         break ;
    }
return count ;
}

void check(string s){
    int x= s.size();
    string ss[x];
    
    for(int i=0 ; i<x ;i++)
        ss[i]= s.substr(i);     
         sort(ss,ss+x);
         int ans=0;
         ans=ss[0].size();
      for(int i=1 ; i<x ; i++)
    ans+=( ss[i].size() - LCP(ss[i],ss[i-1]));
    cout << ans << endl ;    
}

int main(){
    int x=0,y=0 ;   
    cin >> x >> y ;
     string s ;
    cin >> s ;

    while(y--){
          int p ;
          char z ;
         cin >>p >> z ;
         s[p-1]=z;
         check(s);
     }
    return 0 ;   
}

Tuesday, April 22, 2014

Volleyball Match

Tatyana is a big sports fan and she likes volleyball a lot! She writes down the final scores of the game after it has ended in her notebook.
If you are not familiar with the rules of volleyball, here’s a brief:
  • 2 teams play in total
  • During the course of the game, each team gets points, and thus increases its score by 1.
  • The initial score is 0 for both teams.
The game ends when
  • One of the teams gets 25 points and another team has < 24 points ( strictly less than 24).
  • If the score ties at 24:24, the teams continue to play until the absolute difference between the scores is 2.
Given the final score of a game in the format A:B i.e., the first team has scored Apoints and the second has scored B points, can you print the number of different sequences of getting points by teams that leads to this final score?
Input Format
The first line contains A and the second line contains B.
Constraints
0 ≤ A , B ≤ 109
Output Format
Output the number of different sequences of getting points by the teams that leads to the final score A : B. Final means that the game should be over after this score is reached. If the number is larger than 109+7, output number modulo 109 + 7. Print 0 if no such volleyball game ends with the given score.
Example input #00
3
25
Example output #00
2925
Example input #01
24
17
Example output #01
0
Explanation #01
There’s no game of volleyball that ends with a score of 24 : 17.
solution -->

Lets consider there are r1 blocks, so that, there will be r spaces to be filled (including the left of the left most block and right of the right most block). Now, n identical things should be filled in these spaces. There are n+r1 things in that line now, which can be arranged in (n+r1)!ways.
But, we should avoid the arrangements between the n things and r1 blocks, since they are identical.
So, the final answer is


(n+r1)!n!(r1)!=C(n+r1,r1).
code-->

#include<iostream>
using namespace std ;

long long fact[100];
int  p = 1000000007 ;

long long pow(long long a , int b){
    long long x=1 , y=a; 
     while(b){
         if(b%2==1){
         x=x*y;
         x=x%p;
     }
     y=y*y;
    y=y%p ;
    b=b/2;
     }
    return x ;
    
}
int max(int a , int b ){
    return a>b?a:b ;
}

int main(){
    
int a,b ;
    
    cin >> a >> b ;
    
    fact[0]=1 ;
    
    for(int i=1 ; i<100 ;i++)
        fact[i]=(fact[i-1]*(long long)i)%p;
    
    
    if( (a<25 && b<25) || abs(a-b)<2){
        cout <<"0";
        return  0 ;
    }
    
    if(a>=24 && b>=24){
        
        if(abs(a-b)!=2){
             cout <<"0";
        return  0 ;
        }
        
        int z =max(a,b)-2;
        z=z-24;
        long long q= (fact[24]*fact[24])%p ;
        long long ans =(fact[24+25-1]*pow(q,p-2))%p;
        
   long long zz=pow(2,z);        

            ans=(ans*zz)%p;
        
        cout << ans << endl ;
        return 0 ;
          }
    
    
        int y;
        if(a>b)
            y=b;
        else
           y=a; 
  
            long long z = (fact[y]*fact[25-1])%p  ;
        long long ans=(fact[25+y-1]*pow(z,p-2))%p;       
         
    cout << ans << endl ;
    
    return 0 ;
}