top button
Flag Notify
    Connect to us
      Site Registration

Site Registration

Boolean Search using Strings in python?

+1 vote
340 views

I want to do the Boolean search over various sentences or documents. I do not want to use special programs like Whoosh, etc.

May I use any other parser? If anybody may kindly let me know.

posted Apr 23, 2015 by Vijay Shukla

Share this question
Facebook Share Button Twitter Share Button LinkedIn Share Button

1 Answer

0 votes

Here is an old one I wrote. Good for small collections of documents and uncomplicated queries.

https://github.com/jackdied/boolmatch

answer Apr 23, 2015 by Alok Sharma
Similar Questions
+1 vote

I have about 500 search queries, and about 52000 files in which I have to find all matches for each of the 500 queries.

How should I approach this? Seems like the straightforward way to do it would be to loop through each of the files and go line by line comparing all the terms to the query, but this seems like it would take too long.

Can someone give me a suggestion as to how to minimize the search time?

+2 votes

I need to search through a directory of text files for a string. Here is a short program I made in the past to search through a single text file for a line of text.

How can I modify the code to search through a directory of files that have different filenames, but the same extension?

fname = raw_input("Enter file name: ") #"*.txt"
fh = open(fname)
lst = list()
biglst=[]
for line in fh:
 line=line.rstrip()
 line=line.split()
 biglst+=line
final=[]
for out in biglst:
 if out not in final:
 final.append(out)
final.sort()
print (final)
0 votes

I've got a 170 MB file I want to search for lines that look like:

INFO (6): songza.amie.history - ENQUEUEING: /listen/the-station-one

This code runs in 1.3 seconds:

import re

pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
count = 0

for line in open('error.log'):
 m = pattern.search(line)
 if m:
 count += 1

print count

If I add a pre-filter before the regex, it runs in 0.78 seconds (about twice the speed!)

import re

pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
count = 0

for line in open('error.log'):
 if 'ENQ' not in line:
 continue
 m = pattern.search(line)
 if m:
 count += 1

print count

Every line which contains 'ENQ' also matches the full regex (61425 lines match, out of 2.1 million total). I don't understand why the first way is so much slower.

Once the regex is compiled, you should have a state machine pattern matcher. It should be O(n) in the length of the input to figure out that it doesn't match as far as "ENQ". And that's exactly how long it should take for "if 'ENQ' not in line" to run as well. Why is doing twice the work also twice the speed?

I'm running Python 2.7.3 on Ubuntu Precise, x86_64.

0 votes

As per specification, System Information block 1 (SIB1) can broadcast up to six PLMNs in a cell. My question is when I do manual network search at UE, will UE show all the PLMNs or only first PLMN (primary PLMN) in the output ?

+1 vote

Need to implement searching technology for whole website using index,
Required alogrithm or logic for that in php

...